One of the key operational challenges in bike-share system is to maintain a balanced distribution across stations and times to meet user demands. Some stations frequently experience bike shortages where the users cannot get a bike from the docks when they need them. However, some stations frequently experience surpluses where the users cannot return the bikes to the docks when they need them. This rebalance issues become extremely serious during the peak hours due to assymetric commuting patterns. Without timely rebalancing, users may find it difficult to rent or return bikes on timly manners, leading to user dissatisfaction and reduced ridership. Efficient reblancing ahead of user demands ensures the system’s availability and reliability, which also involves real-time data analysis and prediction.
To address this issue, we propose a data-driven approach to predict the bike demand across each stations and time intervals. By analyzing historical data, we can identify patterns and trends in bike usage, allowing us to make informed decisions about where and when to redistribute bikes. This approach not only improves the efficiency of the bike-share system but also enhances the overall user experience by ensuring that bikes are available when and where they are needed. Predictions will be made several hours in advance using the models that incorporate features such as time of day, day of week, historical usage patterns, weather conditions, and current station capacity. By leveraging these data sources, we can optimize the distribution of bikes across the system, reducing the likelihood of shortages and surpluses.
The model will be focused rebalance efforts on the stations in Philadelphia, PA. The model will be trained using the historical bike share data for the month of May, 2022. The model will be used to predict the bike demand for the next 24 hours, and the results will be used to inform rebalancing efforts.
The bike share data is obtained from the Indego. The data includes information on bike trips, including start and end times, start and end stations, and bike IDs. The data is cleaned and preprocessed to remove any duplicates or missing values. The data is then aggregated to hourly intervals to create a time series of bike demand for each station. May 2022 data was selected for this analysis as the temporal frame.
bike<-read.csv("data/indego-trips-2022-q2.csv")
may <- bike %>%
mutate(start_time = mdy_hm(start_time)) %>%
filter(month(start_time) == 5) %>%
mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
may<-may%>%
filter(!is.na(ymd_hms(start_time)))At the same time, census tract geometries of the entire Philadelphia county was acquired from the US Census Bureau to provide the geographic zone boundaries for further analysis if needed.
philCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2022,
state = "PA",
geometry = TRUE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
philTracts <-
philCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sfSince the raw bikeshare data is a non-spatial layer, only containing
latitute and longitude of the start and end stations, we need to join
the bikeshare data with the census tract data to obtain the census tract
information for each trip. The st_join function from the
sf package is used to perform a spatial join between the
bikeshare data and the census tract data. We are only interested in the
start location of each ride because we want to predict the capacity of
each station when the user need a bike to start the ride. This requires
us to filter out rows with missing latitude and longitude.
dat_census <- st_join(may %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lon) == FALSE &
is.na(end_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
philTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lon = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., philTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(to_longitude = unlist(map(geometry, 1)),
to_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)The following map shows all the starting point of the rides in Philadelphia during May 2022.
ggplot()+
geom_sf(data = philTracts %>%
st_transform(crs=4326), fill="white")+
geom_point(data = may, aes(x=start_lon, y = start_lat),
color = "#6E0E0A", alpha = 1, size = 0.3) +
ylim(min(may$start_lat), max(may$start_lat))+
xlim(min(may$start_lon), max(may$start_lon)) +
labs(title = "Location of Rides Starting Points in Philadelphia")+
mapTheme
### Weather data The weather data is obtained from the National Weather
Service (Philadelphia International Airport). The data includes
information on temperature, precipitation, and wind speed. The data is
cleaned and preprocessed to remove any duplicates or missing values. The
data is then aggregated to hourly intervals to create a time series of
weather conditions for each station.
weather.Panel <-
riem_measures(station = "PHL", date_start = "2022-05-01", date_end = "2022-05-31") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))Starting with temporal analysis, we first visualized the bikeshare trips per hour in Philadelphia in May 2023. It is clear that there’s some form of circularity in the data at both peak hours and non-peak hours during single day. We may also see that after five number of high peaks are usually by two lower peaks (weekday and weekend). These correspond the trends mentioned in the introduction section that bikeshare are much more needed during the weekdays rush hours compared to the weekends and late night hours.
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Bike share trips per hr. Philadelphia, May, 2022",
x="Date",
y="Number of trips")+
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Then, we visualize the distribution of trips by stations at each hour in
May. The distribution is right skewed, where most stations have zero or
very few number of trips each hour. However, there are few stations that
have a very high number of trips each hour on particular day in May
2022.
ggplot(dat_census %>%
group_by(interval60, start_station) %>%
tally())+
geom_histogram(aes(n), binwidth = 5, fill="#6E0E0A")+
labs(title="Bike share trips per hour by station. Philadelphia, May, 2022",
x="Trip Counts",
y="Number of Stations")+
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))We perform feature engineering using lubrudate package
to extract temporal information for each usage. the usage was
calculating into 60 min and 15 min interval based on staring time for
each ride. And then, the ride was further categorize into “Overnight”,
“AM Rush”, “Mid-Day” and “PM Rush” based on the hour of the day and into
weekday and weekend based on the day of the week.
We visualize the distribution of the mean number trips at each station during different time intervals of the day. During the PM rush hour, the average number of trips is the highest, followed by the mid-day and AM rush hour. The overnight period has the lowest average number of trips, with only few station with an average of 3 trips or more.
palette4 <- c("#124E78","#F0F0C9","#F2BB05","#6E0E0A")
dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally()%>%
group_by(start_station, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
scale_fill_manual(values = palette4, name="Time of Day") +
labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, May, 2022",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))After that, we visualize the distribution of the mean number of trips at each station during different days of the week. From the chart, we can see that the Monday, Tuesday, Wednesday, and Thursday shows similar temporal patterns, with similar and highest number of trips by the time of the day. The Friday and Sturday shows similar patterns, with smaller number of people utilize the system. And, Sunday shows a uniquepeak pattern than any other weekdays or weekends.But in general, lowest trip count occur in the early morning while highest trip count occur in the late afternoon.
ggplot(dat_census %>% mutate(hour = hour(start_time)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 3,lwd=0.8)+
labs(title="Bike share trips in Philadelphia, by day of the week, May, 2022",
x="Hour",
y="Trip Counts")+
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
We also compared the different usage between weekdays and weekends. The
chart shows that the weekday usage is much higher than the weekend
usage. The weekday usage shows a smaller peak during the AM rush hour
and an even bigger peak during the PM rush hour. The weekend usage shows
a smaller peak during the late morning and early afternoon, with a
gradual decline in usage throughout the day.
ggplot(dat_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in Philadelphia - weekend vs weekday, May, 2022",
x="Hour",
y="Trip Counts")+
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))We visualize the distribution of the trips count at each station during different time intervals on weekday and weekend. As the map shows below, it indicate that the weekday usage is much higher than the weekend usage. On both weekday and weekend, most of the trip starts station clustered around the central city Philadelphia area, with a few trips starts in the west Philadelphia, South Philadelphia. These distribution suggests that trip demands is especially higher in the central city area and major transportation facilities. There is probably need to rebalance the bikes in these areas to meet the demand.
ggplot()+
geom_sf(data = philTracts %>%
st_transform(crs=4326))+
geom_point(data = dat_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
tally(),
aes(x=start_lon, y = start_lat, color = n),
fill = "transparent", alpha = 0.4, size = 3)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station. Philadelphia, May, 2022")+
mapThemeMay in Philadelphia was characterized by steadily warming spring days—morning lows in the 50s °F climbing to highs in the mid-80s by month’s end—with regular 15–20 °F. Percipitation were d mostly light tbut with a few intense downpours at the end of the month. Winds generally hovered around 5–15 mph. These brief wet and blustery interludes punctuated otherwise mild, comfortable weather throughout May.
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") +theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") +theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)),
top="Weather Data - Philadelphia (PHL) - May, 2022")We created a study panel where each instance in the panel is a unique combination of space and time. In other words, each row will represent the ride at a particular station during a particular hour. After that, we add some more information to this panel. This includes counting the number of rides at this station at this particular hour, adding weather information, and bringing in census data.
study.panel <-
expand.grid(interval60=unique(dat_census$interval60),
start_station = unique(dat_census$start_station)) %>%
left_join(., dat_census %>%
select(start_station, Origin.Tract, start_lon, start_lat )%>%
distinct() %>%
group_by(start_station) %>%
slice(1))
ride.panel <-
dat_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, philCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
ride.panel <-
ride.panel %>%
arrange(start_station, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 148,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))With the study panel, we could perform some exploratory analysis to prepare for further regression model. First, we want to see the relationship between the trip counts and the weather conditions. The figure below shows a generally significant, positive, and strong linear relationship between temperature and trip counts. This means in general, when the temperature increase, bikeshare demand tend to increase as well.
ride.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Temperature = first(Temperature)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Temperature, Trip_Count)) +
geom_point(color = "black") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Trip Count As a Fuction of Temperature by Week",
x="Temperature", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))We would also like to see the relationship between the trip counts and the wind speed. The figure below shows in the first two weeks, there is a generally negative linear relationship between wind speed and trip counts. However, in the last two weeks, the relationship is positive. These non-constant relationship may indicate that the wind speed is not the sole predictors influence the ridership.
ride.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Wind = first(Wind_Speed)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Wind, Trip_Count)) +
geom_point(color = "black") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Trip Count As a Fuction of Wind by Week",
x="Wind Speed", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))If trip counts exhibit temporal autocorrelation, incorporating lagged features would improve prediction accuracy. Trip volumes at a given hour are often strongly correlated with those in the preceding and following hours, so knowing the count at one time point makes it easier to estimate volumes in adjacent intervals. To capture this effect, we engineered six distinct types of hourly lag features.
The table and charts below show the Pearson correlations between current trip counts and six lagged variables. Correlation strength falls off as lag increases: immediate (1 h) and daily (24 h) lags show the strongest positive links, mid-range lags weaken steadily, and the 12-hour lag even flips to a negative relationship. These results show that the best predictors are the hour just before and the same hour on the previous day, while hours farther away are less helpful—and the hour 12 hours earlier can even show an opposite trend.
library(dplyr)
library(tidyr)
library(kableExtra)
# compute mean-by-hour, gather lags, then corr by lag variable
corr_df <- as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(
Variable,
levels = c("lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours","lag1day")
)) %>%
group_by(Variable) %>%
summarise(Correlation = round(cor(Value, Trip_Count), 2)) %>%
ungroup()
# render with kableExtra
corr_df %>%
kable(
format = "html",
col.names = c("Lag Variable", "Correlation with Trip Count"),
caption = "Correlation Between Lagged Trip Counts and Current Trip Count"
) %>%
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed")
)| Lag Variable | Correlation with Trip Count |
|---|---|
| lagHour | 0.90 |
| lag2Hours | 0.73 |
| lag3Hours | 0.53 |
| lag4Hours | 0.35 |
| lag12Hours | -0.52 |
| lag1day | 0.80 |
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
ggplot(aes(x = Value , y = Trip_Count)) +
geom_point(size = 1, color = "black") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~Variable, ncol = 6)+
labs(x = "Variable", y = "Correlation", title = "Correlation Between Serial Lag Trips and Trip Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))We split the data into a training set and a test set. The training set includes data from week 20 to week 22, while the test set includes data from week 18 to week 19. The training set is used to train the model, while the test set is used to validate the model. By doing time based splitting, we preserve the natural temporal order, prevent any leakage of information from one period into another. This could ensure that our validation genuinely reflects the model’s ability to generalize to unseen time intervals.
reg1 <-
lm(Trip_Count ~ factor(hour(interval60)) + factor(dotw) + Temperature, data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_station + factor(dotw)+ Temperature, data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station + factor(hour(interval60)) + factor(dotw) + Temperature + Precipitation,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station + factor(hour(interval60)) + factor(dotw) + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station + factor(hour(interval60)) + factor(dotw) + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday,
data=ride.Train)week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))week_predictions %>%
select(week, Regression, MAE, sd_AE) %>%
arrange(week, Regression) %>%
kable(
format = "html",
digits = 2,
col.names = c("Week", "Model", "Mean Absolute Error (MAE)", "SD of Absolute Error (RMSE)"),
caption = "Weekly MAE and Standard Deviation of Absolute Errors by Regression Model"
) %>%
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed")
) %>%
collapse_rows(
columns = 1,
valign = "top",
row_group_label_position = "stack"
)| Week | Model | Mean Absolute Error (MAE) | SD of Absolute Error (RMSE) |
|---|---|---|---|
| 18 | ATime_FE | 0.73 | 0.83 |
| BSpace_FE | 0.74 | 0.89 | |
| CTime_Space_FE | 0.73 | 0.82 | |
| DTime_Space_FE_timeLags | 0.64 | 0.75 | |
| ETime_Space_FE_timeLags_holidayLags | 0.64 | 0.75 | |
| 19 | ATime_FE | 0.71 | 0.83 |
| BSpace_FE | 0.70 | 0.90 | |
| CTime_Space_FE | 0.72 | 0.82 | |
| DTime_Space_FE_timeLags | 0.61 | 0.75 | |
| ETime_Space_FE_timeLags_holidayLags | 0.61 | 0.75 |
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotThemeweek_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon)) %>%
select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = philCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", alpha = 0.7)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
labs(title="Mean Abs Error, Test Set, Model 5")+
mapThemeweek_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotThemeweek_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = philCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", size = 3, alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapThemeweek_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))